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Abstract 

High-resolution images can he used to resolve matching 
ambiguities between trajectory fragments (tracklets), which 
is one of the main challenges in multiple target tracking. 
A PTZ camera, which can pan, tilt and zoom, is a pow¬ 
erful and efficient tool that offers both close-up views and 
wide area coverage on demand. The wide-area view makes 
it possible to track many targets while the close-up view 
allows individuals to he identified from high-re solution im¬ 
ages of their faces. A central component of a PTZ tracking 
system is a scheduling algorithm that determines which tar¬ 
get to zoom in on. 

In this paper we study this scheduling problem from a 
theoretical perspective, where the high resolution images 
are also used for tracklet matching. We propose a novel 
data structure, the Multi-Strand Tracking Graph (MSG), 
which represents the set of tracklets computed by a tracker 
and the possible associations between them. The MSG al¬ 
lows efficient scheduling as well as resolving - directly or 
by elimination - matching ambiguities between tracklets. 
The main feature of the MSG is the auxiliary data saved 
in each vertex, which allows efficient computation while 
avoiding time-consuming graph traversal. Synthetic data 
simulations are used to evaluate our scheduling algorithm 
and to demonstrate its superiority over a naive one. 


1. Introduction 

We consider a system consisting of a single PTZ camera 
(which can pan, tilt and zoom) to solve the problem of track¬ 
ing multiple pedestrians while also capturing their faces. 
A necessary component of such a system is a scheduling 
algorithm that determines at any time step whether to re¬ 
main in zoom-out mode or to zoom in on a face. This pa¬ 
per presents an efficient new data structure, the multi-strand 
graph (MSG), for multiple target tracking using a single 
PTZ system, and a novel scheduling algorithm based on it. 

Our method aims to overcome one of the main chal¬ 
lenges of a multiple target tracker: trajectory fragmenta¬ 
tion. Such fragmentation is caused by occlusions, by the 


joining of two or more targets who then split (e.g., Fig¬ 
ures 1(a), 2(a)), or when the PTZ camera zooms in on an¬ 
other target, creating what we call a blind gap (Figure 1(b)). 
Matching trajectory fragments (tracklets) is complicated 
due to ambiguities caused by similarity in appearance and 
location of different targets. We propose using the faces 
captured in zoom-in mode together with the available infor¬ 
mation of the system state, to resolve such ambiguities. 

The objective of the proposed system is to maximize the 
total length of the labeled tracklets , that is, trajectory frag¬ 
ments with associated high resolution face images captured 
in zoom-in mode. Formally, let Z be the set of targets, and 
let {^(z)}^ be the set of its detected tracklets. For each 
target z £ Z we define tl{z) to be the union of the labeled 
tracklets of z. The objective is given by 


M = 


T, zGZ \tl(z) I 

T2 z ez T2i =i fid 2 )! 5 


( 1 ) 


where |. | denotes the tracklet length. Our scheduler selects 
the target with the highest probability to maximize M to be 
the next target that will be zoomed in on. Note that resolv¬ 
ing ambiguities (e.g., matching u 2 and vq in Figure 2(a)) 
can greatly increase M. 

We introduce the Multi-Strand Tracking Graph (MSG), 
which represents the tracklets computed by a tracker and 




(a) (b) 

Figure 1. (a) Two targets walk separately, join and then split, (b) 
A blind-gap scene and its MSG: targets who walk separately but 
move out of sight when the camera zooms in on another target, 
and then become visible again in the next zoom-out mode. Cir¬ 
cle nodes represent solo vertices. A diamond node represents a 
compound vertex. 






















(a) (b) (c) (d) 

Figure 2. (a) Three targets walk in a scene, with two join and split events, (b) The corresponding MSTG. (c) First untangling step after 
matching vq to V 2 using high-resolution images; (d) The final MSTG after untangling: each solo vertex represents a full target trajectory. 
Circle nodes represent solo vertices, and diamond nodes - compound vertices. 


their possible associations (its basic structure is similar to 
[14]). We show that a straightforward use of the graph for 
the abovementioned task requires a graph traversal. The 
main contribution of this paper is the proposed auxiliary 
data stored in each vertex. We use this data to efficiently 
compute the system state information without traversing the 
graph. The graph is constructed online and the auxiliary 
data is recursively computed based only on the vertex itself 
and on its direct parents. Hence, all the required informa¬ 
tion is available when scheduling decisions are made. Other 
contributions of this paper are the use of high-resolution im¬ 
ages to resolve matching ambiguities of tracklets and the de¬ 
sign of an efficient scheduling algorithm that uses the MSG. 

System overview: The tracking system considered in this 
paper consists of a single PTZ camera, and several com¬ 
ponents, described below, are assumed to be available. 
These include a tracker that detects and tracks pedestrians in 
zoom-out mode. It also detects joining and splitting events 
of two or more targets moving together (as in [14]). Accord¬ 
ing to the proposed scheduler, the system selects a person to 
zoom in on using a camera control algorithm. The control 
algorithm chooses the FOV that makes it possible to zoom 
in on selected target (e.g., [2]). In the zoom-in mode, a face 
image is acquired and a face-to-face and a face-to-person 
matchings are computed. The system then zooms out, to the 
same wide view, to continue tracking. A person-to-person 
matching module associates tracklets when returning from 
zoom-in mode or after targets split from a group. Figure 3 
summarizes the system components. Our contribution to 
the system is the graph representation (MSG) and the effi¬ 
cient scheduling algorithm. 

2. Previous Work 

Scheduling of a single PTZ camera was considered in 
[13, 2, 11, 12]. Scenarios of joining/splitting targets were 
considered in [13, 2]. The goal of [1 ] was to minimize the 
slew time of an aerial camera tracking cars. High-resolution 
images were used to remove incorrect prediction hypothesis 


(stored as a tree). The greedy policy in b ] aimed to max¬ 
imize the number of captured faces, considering the pre¬ 
dicted time of each target to exit the scene and its movement 
angle w.r.t. the camera. An information-theoretic approach 
[11, 12] aims to decrease location uncertainty while captur¬ 
ing high-resolution images. A distributed game-theoretic 
approach for scheduling multiple PTZ cameras [7] aims to 
maximize the targets’ image quality and to capture their 
faces. None of the above scheduling algorithms considered 
the goal of resolving tracklet-matching ambiguities. 

Other systems considered setups with both fixed and 
PTZ cameras, in a master-slave configuration. Such se¬ 
tups are less challenging since a fixed camera continuously 
views the entire region. They vary from a single master and 
a single slave [1, 3] to multiple masters and multiple slaves 
[4, 6, 10, 16]. The objectives in these studies are to ac¬ 
quire once [4], or as many times as possible [3], the face of 
each target, or to minimize camera motion [1]. The schedul¬ 
ing methods consider the expected distance from the camera 
[10], the viewing angle [1,6, 13, 16], and expected occlu¬ 
sions [6, 13]. In addition to these objectives, our algorithm 
also considers how zooming in contributes to the resolution 
of past and future ambiguities of tracklet matching. 

Graphs were previously used to represent relations be¬ 
tween tracklets [5, 9, 15, 17, 18], where the weighted edges 
reflect the appearance similarity and the consistency of lo¬ 
cation with respect to the computed motion direction and 
sometimes speed. A graph with a similar structure to the 
MSG [5, 8, 14] was used to associate isolated tracklets of 
targets with indistinct appearance as well as tracklets of a 
set of targets that cannot be separated. The joins/splits of 
targets were computed by a tracker. The association of sin¬ 
gle target tracklets is solved by finding the most probable 
set of paths. All these papers use the target’s location and 
only one appearance descriptor level for matching while we 
use both low- and high-resolution images. Moreover, they 
do not use auxiliary data, which allows efficient scheduling 
and online graph updating in our method. 
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Figure 3. An overview of the system and its operation modes 


3. Method 

We first describe the basic structure of the MSG graph. 
Next, we extend the MSG graph with auxiliary data for ef¬ 
ficient matching by elimination. Finally, our scheduling al¬ 
gorithm is presented. 

3.1. Graph Definition 

The basic structure of the MSG is a dynamic augmented 
graph, G = (V,E), where V represents the set of tracklets 
computed by the tracker, and E the candidate associations 
of different tracklets computed by some available match¬ 
ing algorithm (similar to [14]). Each vertex is associated 
with the information regarding its tracklet. We consider two 
types of vertices that represent two types of tracklets. A solo 
vertex represents the tracklet of a single target and a com¬ 
pound vertex represents the shared tracklet of joined targets, 
that is, a set of targets that walk together (Figure 2(b)). 

A directed edge, e = (vi,Vj) G E , represents the case in 
which at least one of the targets associated with Vi may also 
be associated with vj , and Vi and Vj correspond to consec¬ 
utive time intervals (ignoring zoom-in time). A compound 
vertex is generated as the child of other vertices when the 
tracker detects that the targets’ trajectories are joined into 
indistinguishable tracklets (see Figures 2(a), 2(b)). A new 
solo vertex is generated when a new target enters the scene, 
a target trajectory splits from those of others (as a child of 
the compound vertex), or a target reappears when the cam¬ 
era returns to zoom-out (after a blind gap). 

Edges between solo vertices are generated at consecu¬ 
tive layers (e.g., when returning from zoom-in mode), ac¬ 
cording to a matching algorithm that is based on the tar¬ 
gets’ low-resolution images captured in zoom-out mode and 


on their locations. When the matching of a new solo ver¬ 
tex is ambiguous, edges are set between the vertex and all 
the matching candidates, forming an X-type ambiguity (see 
Figure 1(b)). Additional edges are generated between com¬ 
pound and solo vertices, based on the tracker’s detection of 
splitting and joining targets. 

3.2. Untangling 

When no ambiguities are present, the trajectory of each 
target can be fully recovered, and the graph contains only 
unconnected solo vertices. We wish to reduce as much as 
possible the number of vertices by concatenating consec¬ 
utive tracklets to a single tracklet when possible. When a 
univocal matching exists for a consecutive set of tracklets 
(all of the same target), their corresponding vertices form a 
solo chain in the graph - in which the out degree and the in 
degree of a vertex’s parent and child, respectively, are one 
(Figure 2(c)). A solo chain, of any length, can be merged to 
a single solo vertex (Figure 2(d)). A compound chain can 
be defined and merged similarly. 

A univocal matching of a pair of non-consecutive solo 
vertices, {u, vl }, can be computed using an available face- 
to-face matching algorithm. In addition, an indirect match 
can also be obtained by elimination (see Section 3.3). 

Such a univocal matching of {u, vl} can be used for un¬ 
tangling the graph as long as the connected component of 
u and vl is a DAG (a graph that does not contain cycles). 
In this case, a breadth-first-search (BFS) algorithm is used 
to recover the graph path, £(u,vl) (e.g., £{v> 2 ,vg) in Fig¬ 
ure 2(b)). All vertices of l(u, vl) are guaranteed to repre¬ 
sent consecutive tracklets of the same target. Hence, edges 
‘to’ and ‘from’ solo vertices of £(u,vl) \ u (representing 
X-type ambiguities) are removed except those that are part 
of the path. Each compound vertex v comp G £{u^vl) is 
split into two vertices. One solo vertex represents only the 
labeled target and is linked only to the solo chain. The sec¬ 
ond vertex, v sp u t , represents the remaining targets of the 
compound vertex and is disconnected from the chain (Fig¬ 
ure 2(c)). As a result, a solo chain of the labeled target, 
and possibly additional solo chains of other targets, are ob¬ 
tained. Each chain can be merged into a single solo vertex 
(Figure 2(d)). Note that no information is lost in the untan¬ 
gling process. 

3.3. Matching by Elimination 

When there is sufficient confidence that a labeled vertex 
cannot be matched to any of the previously labeled vertices, 
the vertex can sometimes be indirectly matched to an un¬ 
labeled vertex by elimination. For example, assume that 
vi , V 2 and vg in Figure 2(b) were labeled and no match was 
found for the faces of either vi or V 2 with vg. It is possible to 
deduce that ^3 is the correct match to vg. Similarly, if only 
U 3 and vg were labeled, the non-source U 5 is deduced to be 

















































this match. We next define when an indirect match can be 
found in the general case, and how to compute it efficiently. 
Let Vl be the set of labeled solo vertices. We define an 
unlabeled path between w and v, £(w,v), to be a path that 
does not contain any labeled vertex except possibly w and 
v, that is, \/u G £(w, v ) \ {w, v }, u £ Vl. 

Claim 1: Sufficient and necessary conditions for a solo 
vertex w £ Vl to be an indirect match to v G Vl are (i) v 
cannot be matched to a previously labeled vertex; (ii) there 
exists an unlabeled path, £(w, v)\ (iii) if an unlabeled solo 
vertex w' satisfies (ii) then w' G i{w, v). 

Proof: We begin with proving that (i)-(iii) are necessary 
conditions. Assume w is an indirect match of vl- Then 

(i) must hold since otherwise vl can be directly matched; 

(ii) must hold since otherwise either £(w,vl) does not ex¬ 
ist and hence no match between w and vl is possible, or 
3w' G £(w,vl), where w' is a labeled solo vertex. How¬ 
ever, an indirect match of w and vl implies a match be¬ 
tween all solo vertices u G £(w,vl) and vl- Hence, vl 
could be directly matched to w', which contradicts condi¬ 
tion (i). Finally, (iii) must hold since otherwise there exists 
w' £(w,vl) that satisfies (ii). It follows that more than 
one feasible indirect match to vl exists. Hence, there is 
insufficient information to determine which of them is the 
correct one, and an indirect match of w and vl cannot be 
determined. 

We next prove that if conditions (i)-(iii) hold, then w is 
an indirect match of vl- From condition (i) it follows di¬ 
rectly that w cannot be directly matched to vl- From con¬ 
dition (ii) it follows that £(w,vl) exists; hence w is a pos¬ 
sible match. It is left to show that w is the only feasible 
match. From condition (iii) it follows that w is the only 
feasible match to vl since any other match, w', satisfies 
w' G £(w, vl). 


When (i) holds, an indirect match to a labeled solo vertex 
vl can be computed in a straightforward manner by travers¬ 
ing the graph backwards from vl and checking whether a 
vertex w that satisfies (ii) and (iii) exists. This is clearly 
time consuming. Instead, we propose to store auxiliary data 
in each vertex; this data, which can be efficiently computed 
online from the vertex itself and its parents, makes it possi¬ 
ble to directly compute an indirect match, if one exists. We 
will also use this data later for scheduling. 

Auxiliary data for matching by elimination: We define w 
to be an origin of v if (i) w is a solo vertex; (ii) there ex¬ 
ists an unlabeled path £(w,v) and (iii) w is either a source 
of the graph ( unlabeled origin ) or a labeled vertex ( labeled 
origin). A labeled vertex is the origin of itself and has no 
unlabeled origins. The set of origins of v consists of the set 


of vertices - each associated with a distinct target ID - that 
may represent the same target as v. Note that only a labeled 
origin of v may be directly matched to v. 

We observe that a solo vertex v may have an indirect 
match only if it has at least one unlabeled origin (otherwise 
it can only be directly matched). Furthermore, v may have 
an indirect match only if just one of its parents has unla¬ 
beled origins (otherwise, the unlabeled origins, one from 
each parent, do not satisfy (iii) of Claim 1). Hence, to com¬ 
pute whether an indirect match exists, it is sufficient to store 
in each vertex the number of its unlabeled origins, denoted 
by n^L (v), and the single parent that has unlabeled sources, 
if one exists, p^(v) (set to zero if one does not exist). Let 
P(v) be the set of parents of v. Both tl^l (v) (given by sum¬ 
ming the number of unlabeled origins of P(v)) and p^(v) 
can be recursively defined as follows: 


n ./.('•) = < 


P<-(v) 


1 v$V L & \P(v)\ =0 

0 v e V L 

J2 Vi eP(v) n ^L(vi) otherwise. 


u 3 ! u | u £ P(v) & ti-,l(u) > 0 

0 otherwise. 


( 2 ) 

( 3 ) 


Note that if a vertex w is the indirect match of a solo ver¬ 
tex p<_(v), it is also the indirect match of v. Therefore, we 
can efficiently and recursively compute the single candidate 
of an indirect match of v,C(v): 


C(v) 


0 71^l(v) = 0 

< C(p^(v)) p<-(v) 7^ 0 & C(p<-(v)) ± 0 
solo(v) otherwise, 

(4) 


where solo(v ) holds v if v is a solo vertex and 0 otherwise. 

Note that if v is a compound vertex, it cannot have an 
indirect match; however, the value C{v) contains the can¬ 
didate indirect match for its descendants. After labeling v 
and untangling the MSG (if such untangling is possible), 
the auxiliary data is recalculated to be ti-,l{v) = 0. This 
reflects that ambiguities of this target prior to the labeling 
are no longer relevant for future ambiguities. After each 
labeling and once the untangling is complete, C must also 
be recalculated for all the vertices that were disconnected 
from the solo chain during the untangling process. Each of 
these vertices then propagates the updated value to all its 
descendants, who recalculate their own values accordingly. 


3.4. Scheduling 

The scheduler selects the tracklet whose target’s face will 
be acquired in the next zoom-in mode or selects to stay in 
zoom-out mode. The score it provides reflects the expected 
contribution of a tracklet labeling to maximize the system’s 
objective, M (Eq. 1). A prerequisite for choosing to zoom 






(C) (d) 

Figure 4. Screen shots of simulation A. (a) Trajectory trails of 3 targets, using (X,Y,t) coordinates, (b) The final MSG obtained by our 
method, (c) The final MSG obtained by our method when untangling is not used, (d) The final MSG of the naive scheduler. Asterisk: a 
labeled vertex. 


in on an unlabeled target is that the acquisition of its face 
is expected to be successful. A Boolean value that indi¬ 
cates the expected success, E s (v), can be computed by the 
tracker in a similar manner to previous studies (e.g., [ 2 ]). 
For example, the motion direction can be used for predict¬ 
ing occlusions and time to exit, and whether the face will be 
visible to the camera. 

Ideally, the minimal number of required labelings for a 
full trajectory retrieval of N targets is N , one per target. 
The upper bound of the required number of ideal labelings 
is (2 n i — 1) < 2N — 1 , where n* = \Gi\ and Gi C G 

is a connected component. This sum includes the labeling 
of the first and last solo-walking tracklet of each target z. 
Thus, the full trajectory of 2 is recovered and labeled (under 
the no-cycle assumption). Note that each untangling may 
further reduce the required number of labelings. 

In practice, an ideal labeling set is often impossible to 
obtain: the online algorithm leaves limited time for zoom¬ 
ing in, and each labeling may cause additional ambiguities 
due to a blind gap. Moreover, the target identity and hence 
its contribution to M is unknown before zooming in. 

Therefore, we propose a scheduling algorithm that ap¬ 
proximates the estimated contribution of labeling each of 
the targets or staying in zoom-out mode to maximize M. 
A zoom-out score , S zo , can reflect global properties of the 
scene, such as the number of new targets expected to en¬ 
ter it, and the prevention of X-type ambiguities caused by a 
blind gap in zoom-in mode. Here we set it to be a constant. 
A labeling score , Sl(v ), is set for each vertex and reflects 
the expected contribution to M if v is chosen to be labeled. 
The v to be selected for labeling is the one with the highest 
Sl as long as Sl(v) > S zo . Otherwise, the system remains 
in zoom-out mode. 

3.5. Labeling Score & Auxiliary Data 

The score Sl(v) is a weighted sum of two terms that es¬ 
timate the expected resolution of future (Sf) and past (Sp) 
ambiguities: 

S L (v) = E,(v) (a(v)S F (v) + P(v)S P (v)) , (5) 

where the weights a{v) and /3(v) are higher for source and 
expected sink vertices, respectively. 


Future ambiguities: The probability that v was not la¬ 
beled before is given by /n 0 {v)), where n 0 (v) 

and are the number of origins and unlabeled ori¬ 

gins of v, respectively. The score Sf{v) is defined to 
be Sp(v ) = Join(v)n^L{v)/n 0 (v), where Join[v ) is a 
Boolean value that reflects that the target of v is expected to 
join another target with a similar appearance (computed by 
the tracker). The value of can be recursively com¬ 

puted (see Eq. 2). In a similar manner, n 0 (v) can also be 
recursively computed (as specified in Appendix A). 

Past ambiguities: The score Sp(v ) reflects the expected 
increase in the length of the labeled tracklets, L = 
EsezM*)! = E„ 6 Vi l r MI’ ifu is chosen for labeling. 
Labeling a vertex v increases the length of L by the length 
of the tracklet r(v). In addition, if v is matched to u, di¬ 
rectly or indirectly, then L is extended by the sum of \r (w) \ 
over all re E £(u,v) such that w £ Vl- The identity of v’s 
target, and hence the origin to which v will be matched, is 
unknown prior to its labeling. Hence, we average over all 
possible increases of L with respect to the n 0 (v) possible 
origins that may be the match of v: 


Sp(v) = —(A L dir (v) + L^ dir {v )), (6) 

n 0 (v) 


where L^ dir (v) and L dir (v ) sum the increase of L over the 
sets of unlabeled origins and labeled origins, respectively. 

A straightforward computation of Sp(v ) is by graph 
traversal. To avoid such a computationally expensive oper¬ 
ation for each candidate vertex, we store in each vertex the 
auxiliary data fields, AL^ r (u), I/^^ r (u), and n 0 {v). These 
values are recursively computed. We next describe the com¬ 
putation of L^dir{v). (For ALdir(v), see Appendix A.) 

When no match is found (either direct or indirect), the 
contribution of labeling v to L-,dir(v) is I t (t)| f° r ea °h un¬ 
labeled origin. If an indirect candidate match is given by 
C(v), that is, C(v) £ {0, u}, then the additional contribu¬ 
tion of labeling v is given by \r(£(C(v),p^(v)))\. To com- 


















Figure 5. Screen shots of simulation B. (a) Trajectory trails of all the targets that entered or already exited the scene, using (X,Y,t) coordi¬ 
nates. Solid lines: labeled tracklets; dashed lines: unlabeled tracklets; black triangle: a face acquisition; asterisk: a labeled target, (b-f) Our 
scheduler’s MSG after: (b) four targets are labeled, two of which before joining other targets; (c) a direct match event of the red vertices; 
(d) the following untangling; (e) an indirect match event of the green vertices; (f) the following untangling, (g) The final MSG obtained by 
our method, (h) The final MSG of the naive scheduler. Asterisk: a labeled vertex. 


pute it, L—tdir (p<— (^0) is computed recursively as follows: 

L—idirip) — 

jf \r(v)\ • n^ L (v) + L^ dir (p^(v)) v £ Vl & p«-(v) ^ 0 
< \r(v)\ • n^ L {v) v £ V L & P^(v) = 0 

[o veV L . 

(7) 

Complexity: The computation of the score is linear with 
\P(v)\ - which is expected to be small for each visible tar¬ 
get - instead of O (| E | +1 V \) for the necessary graph traver¬ 
sal without the auxiliary data. Note that without untangling, 
the graph is expected to grow very fast when more targets 
enter the scene and many tracklets are detected, hence mak¬ 


ing the alternative 0(\E\ + |^|) even worse. Due to the 
overhead incurred by untangling, the auxiliary data of all 
the descendant vertices must be updated. In the worst case, 
it will require updating 0(|V|) vertices. However, this op¬ 
eration is rarely performed. Moreover, each time it takes 
place, the size of the graph is greatly reduced. Hence, the 
amortized complexity of updating the graph is expected to 
be 0(1) for each new vertex. A formal proof of this conjec¬ 
ture is left for future study. 

4. Experiments 

We used simulated data as an input to our method to eval¬ 
uate the scheduler’s performance independently from that 
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Figure 6. Screen shots of simulation C. (a) Trajectory trails, (b) The final MSG of our method, (c) The final MSG of our method when 
untangling is not used, (d) The final MSG of the naive scheduler. Asterisk: a labeled vertex. 


of the other modules. Simulated data also makes it possible 
to bypass the limitations of comparing online algorithms on 
the same real data; each algorithm dictates different zoom- 
in operations, thus changing the data. We implemented our 
method as well as the simulated data using Matlab. 

The simulated scene consists of a set of targets walking 
on a grid of intersecting diagonal roads. The targets’ ve¬ 
locities (speed and direction), entrance time and location, 
and the probability that meeting targets start walking to¬ 
gether, are determined randomly. All targets have the same 
low-resolution appearance to increase ambiguity, and low- 
resolution images are not used for matching. 

We present the objective score of our algorithm, M 
(Eq. 1), as a function of the expected ambiguities in the 
scene. It is computed when the simulation ends and is based 
on the MSG’s tracklets and on the ground truth. We esti¬ 
mate the ambiguities of the scene by Nj/ s = Y, ze znj/s( z )> 
where rij/ s (z) is the number of times a solo-walking tar¬ 
get, z, joins a group and then splits to walk alone again. 
Note that in practice blind gaps may cause additional ambi¬ 
guities. For comparison we consider a naive scheduler [2] 
that selects the unlabeled target predicted to leave the scene 
first. In both cases, we assume that the tracker provides the 
necessary available information (e.g., whether the face is 
expected to be captured successfully). 

A simple simulation of 3 targets and one join/split event 


(Figure 4(a)) demonstrates a scenario where our scheduler 
selects the labeled target to be one of the two targets that 
are predicted by the tracker to join, before this event occurs. 
Consequently, one additional labeling after the split event 
untangles the MSG into an ideal graph, and all the trajecto¬ 
ries are fully recovered (Figure 4(b)). When our scheduler 
is used without the untangling process, its final MSG is not 
ideal and a full recovery is not achieved (Figure 4(c)). The 
naive scheduler selects the joining targets for labeling only 
after they split, thus preventing a full recovery and achiev¬ 
ing the lowest M (Figure 4(d)). 

Figure 5(a) presents an example with a large number of 
targets and ambiguities. The MSG is growing rapidly but 
our scheduler achieves untangling in key points (see Fig¬ 
ure 5(b-f)), allowing the final MSG to contain only one 
remaining ambiguity (Figure 5(g)). The naive scheduler 
achieves a significantly lower M due to a final MSG with 
many unresolved ambiguities (Figure 5(h)). Another com¬ 
plex example is presented in Figure 6. 

The results on 414 simulations are presented in Figure 7. 
When Nj/ s is small, the performance of our and the naive 
algorithms is close to perfect. When Nj/ S increases, the 
performance of our method decreases, mainly due to the 
limited time available to label all the desired targets. How¬ 
ever, for moderate ambiguity of Njj s = 15, our method 
still performs well: M > 0.85. The superiority of our algo- 








































rithm over the naive one is apparent both for moderate and 
high Nj/ S . For example, the score of the naive algorithm 
obtained for Nj/ S = 15 is M = 0.55, which is lower than 
the worst score of our method, for Nj/ S > 30. 

Two components of our algorithm contribute to its su¬ 
periority over the naive one. The global view we keep of 
the system state allows us to associate one or more labeled 
tracklets of the same target with additional tracklets of that 
target. Using graph terminology, this corresponds to the un¬ 
tangling and merging of vertices, either by direct labeling 
or as a byproduct of labeling other targets. In addition, our 
scheduling method explicitly considers the task of disam¬ 
biguating tracklet associations, and uses global information 
of the current state of the system efficiently. 

5. Discussion & Future Work 

We proposed a method for tracking multiple pedestrians 
and capturing their faces using a single PTZ camera. The 
goal of the system is to maximize the length of the labeled 
trajectories recovered by the tracker. Our main contribution 
is a novel data structure, MSG, that efficiently utilizes all 
the available global information of a tracking system. The 
auxiliary data of the MSG is used for an efficient scheduling 
algorithm that resolves or prevents tracklet ambiguities and 
matches tracklets directly or indirectly via target labeling. 
The MSG may be modified for various applications that use 
several cameras, with or without overlapping fields of view, 
when two distinct resolution levels can be used for resolving 
ambiguities. This is left for future research. 

Our method aims to represent and efficiently use the data 
available from basic components of trackers and recogni¬ 
tion systems, most of which are assumed to be determinis¬ 
tic for ease of exposition. It is clearly prone to the expected 
errors of each of these components. 

Our method can be extended to handle a probabilistic 
setting where each component provides a degree of confi¬ 
dence for its output. This can be integrated into the graph 
by, for example, associating a weight with each edge. In 
the current system, X-type edges can represent the output 
of a probabilistic person-to-person matching algorithm. A 
threshold on the face-to-face matching confidence may be 
used for deciding whether to untangle the graph or wait for 
additional information. 

Acknowledgment: This research was supported by the Is¬ 
raeli Ministry of Science, grant no. 3-8700, and by Award 
No. 2011-IJ-CX-K054, awarded by the National Institute 
of Justice, Office of Justice Programs, U.S. Department of 
Justice. 

References 

[1] A. Bagdanov, A. del Bimbo, and F. Pernici. Acquisition 
of high-resolution images through on-line saccade sequence 



Figure 7. Results of all simulations. M as a function of Nj/ s . 


planning. In VSSN, 2005. 2 

[2] Y. Cai, G. Medioni, and T. Dinh. Towards a practical PTZ 
face detection and tracking system. In WACV , 2013. 2, 5, 7 

[3] C. Costello, C. Diehl, A. Banerjee, and H. Fisher. Scheduling 
an active camera to observe people. In VSSN , 2004. 2 

[4] C. Costello and I. Wang. Surveillance camera coordination 
through distributed scheduling. In CDCECC , 2005. 2 

[5] J. Henriques, R. Caseiro, and J. Batista. Globally optimal 
solution to multi-object tracking with merged measurements. 
In ICCV, 2011. 2 

[6] S. Lim, L. Davis, and A. Mittal. Constructing task visibility 
intervals for video surveillance. MS, 12(3), 2006. 2 

[7] A. Morye, C. Ding, A. Roy-Chowdhury, and J. Farrell. Dis¬ 
tributed constrained optimization for bayesian opportunistic 
visual sensing. TCST, 2014. 2 

[8] P. Nillius, J. Sullivan, and S. Carlsson. Multi-target tracking¬ 
linking identities using bayesian network inference. In 
CVPR, 2006. 2 

[9] J. Prokaj, M. Duchaineau, and G. Medioni. Inferring track- 
lets for multi-object tracking. In CVPRW, 2011. 2 

[10] F. Qureshi and D. Terzopoulos. Surveillance in virtual re¬ 
ality: System design and multi-camera control. In CVPR, 
2007. 2 

[11] P. Salvagnini, F. Pernici, M. Cristani, G. Lisanti, I. Masi, 
A. Del Bimbo, and V. Murino. Information theoretic sensor 
management for multi-target tracking with a single pan-tilt- 
zoom camera. In WACV, 2014. 2 

[12] E. Sommerlade and I. Reid. Information-theoretic active 
scene exploration. In CVPR, 2008. 2 

[13] T. Strat, P. Arambel, M. Antone, C. Rago, and H. Landan. A 
multiple-hypothesis tracking of multiple ground targets from 
aerial video with dynamic sensor control. In SPIE. 2004. 2 

[14] J. Sullivan and S. Carlsson. Tracking and labelling of inter¬ 
acting multiple targets. In ECCV. 2006. 2, 3 

[15] X. Wang, E. Tiiretken, F. Fleuret, and P. Fua. Tracking in¬ 
teracting objects optimally using integer programming. In 
ECCV. 2014. 2 

[16] C. Ward and M. Naish. Scheduling active camera resources 
for multiple moving targets. In CCECE, 2009. 2 

[17] Z. Wu, T. Kunz, and M. Betke. Efficient track linking meth¬ 
ods for track graphs using network-flow and set-cover tech¬ 
niques. In CVPR, 2011. 2 






























[18] B. Yang and R. Nevada. An online learned CRF model for 
multi-target tracking. In CVPR , 2012. 2 

Appendix A 

This appendix provides the recursive computation of la¬ 
beling score computation, n 0 (v) and AL^ r (u), of Sec¬ 
tion 3.5. 


Recursive computation of n 0 (v): The number of origins 
of each vertex, n 0 (v), is recursively defined by: 

n ^ _ I 1 v G Vl or \P(v) \ = 0 

[E Vi eP(v) n o( v i) otherwise. 

( 8 ) 

Note that using (given in Eq. 2 of Section 3.3) and 

n 0 (v), we can also recursively compute the number of la¬ 
beled origins of v: 

n L {y) = n 0 (y) - n^ L (v) . (9) 


Recursive computation of AL^ r (u): Let us first con¬ 
sider the path v), where u is a labeled origin of v. Its 
contribution to Aconsists of \r(£(u, v) \ — \r(u) |, since 
it is a labeled vertex prior to the labeling of v. Consider 
w G £(u,v) fl P(v), that is, the parent of v on the path 
£(u,v). It is possible to decompose \t(£(u,v))\ into the 
sum: \r(£(u,v))\ = \t(£(u,w))\ + \r(v)\. It follows that 
v contributes \r(v)\ for each of its possible direct match¬ 
ings, that is, til • |r(u)|. In addition, the value AL^ r (u) 
consists of the sum of AL^ f (tr) for each of the parents of 
v on possible direct match paths. Hence, A Ld ir (v) can be 
recursively computed: 

A L .(„)=/ I r ( w )l ' nL + £„ 4 eP(») &L dir (vi) v$V L 
dlrK ' \o v € V L . 

( 10 ) 


Let us consider again the path i{u, v), where u is a la¬ 
beled origin of v. We wish to find the contribution of u 
to A Ldir(v) when considering not only u itself but also its 
forward labeling extension. This contribution excludes the 
entire extension of u, which is labeled prior to the labeling 
of v. The length of the forward labeling extension of u, 
\t(ct(u))\, is therefore subtracted from \t(£(u,v))\. That is, 
the contribution of the possible matching of u to v is given 
by | t(£(u, u))| — \r(a(u))\. We next describe the auxiliary 
data needed to compute the refined A Ld ir (v) efficiently. 

Lor each vertex v, we define the number of forward la¬ 
beling chains in which v is included, n ret (v). This value 
can be recursively computed based only on the vertex itself 
and its direct parents, as follows: 

(v) = { nret ^ ' chain (Vi) V$V L 

retK ’ \i veV L , 

( 11 ) 

where the Boolean function chain(v) determines whether 
v has only one child. The refined recursive computation 
of A Ldir{v) (that replaces Eq. 10 above) is given by: 


A L di r (v^ — 

I l T 0)l • (n L (v) - n ret (v)) + E Vi eP(v) A L dir [vi) v £ V L 
\0 v G V L . 

( 12 ) 

Note that the labeling of a vertex can also be extended 
backwards, in a manner similar to the forward labeling ex¬ 
tension. Both extensions are considered in the experiments 
for the evaluation of our scheduler, but only the forward 
labeling extension is useful for the refined AL^ r (u) com¬ 
putation. 


Refinement of the A L dir (v) computation: The labeled 
tracklet of each labeled origin of v, u, clearly consists of 
the tracklet represented by u itself, r(u). In addition, a la¬ 
beling of a vertex can sometimes be extended also to label 
its parents and children. Lor example, assume that v\ of a 
target z is labeled in Ligure 2(b). The tracklet r{y±) clearly 
follows t(v i) for this target, and is therefore an extension 
of t(v i). Lormally, let ^ bea labeled vertex with a single 
compound child, w. The tracklets t(vl) and r(w) represent 
the same target. Hence, t(vl) can be extended to r(w). 
As a result, the length of the labeled tracklets is given by 
\r(v L )\ + \t(w)\. Such a forward labeling extension can be 
applied recursively to any forward labeling chain , 
which is a path from vl in which each vertex is a single 
child of its parent. A Ld ir (v) can be estimated more ac¬ 
curately by considering forward labeling extensions of the 
labeled origins of v, as described next. 


